Skip to content

[pull] master from tensorflow:master#1688

Merged
pull[bot] merged 39 commits into
GesuBackups:masterfrom
tensorflow:master
Apr 3, 2026
Merged

[pull] master from tensorflow:master#1688
pull[bot] merged 39 commits into
GesuBackups:masterfrom
tensorflow:master

Conversation

@pull

@pull pull Bot commented Apr 2, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

codeXsidd and others added 22 commits March 21, 2026 10:38
This CL is part 1 of cl/886092985. It lands the logic for negation (-v0 instead of + -1 * v0) and simplifies map output by omitting empty symbol brackets []. It also deprecates many AffineExpr/Map methods in indexing_map_serialization. Landing this minimizes the number of test updates we have to do in the following CLs.

PiperOrigin-RevId: 893497910
…s to avoid instruction cache thrashing.

PiperOrigin-RevId: 893502849
…hine.

This change introduces a static factory method `TargetMachineOptions::Native()` which automatically populates the target triple, CPU name, and CPU features based on the host machine's characteristics using LLVM's host detection utilities. A test is added to ensure the inferred options match LLVM's host CPU information.

PiperOrigin-RevId: 893503067
PiperOrigin-RevId: 893503478
…cessing.

This was relied on for using local wheels as overrides for matching
requirements defined in lock files, and will allow to stop using the purely
suggestive `--find-links` that was needed for the 1.8.4 upgrade because of
the regression that's fixed in 1.8.5.\
`pkg @ wheel URL` will be used instead for local wheels again.

The 1.8.5 upgrade is a tiny regression fix-only upgrade and thus doesn't require any other adjustments:\
https://rules-python.readthedocs.io/en/latest/changelog.html#v1-8-5

More fix context:\
openxla/xla@11a2044
openxla/xla@0025bf7

PiperOrigin-RevId: 893537944
When the user is not requesting a specific CPU architecture
XLA automatically detects the host architecture and passes this
information to FFI based custom calls and XLA:CPU. So far
this detection has been ignoring architecture features
(like SSE, Neon, AVX, etc.).

So this change adds the missing HW feature detection and also
updates the embedded system configs. It also adds a tests
that ensures the the embedded system configs are in sync with
the actual systems.

The feature detection takes `DebugOptions::xla_cpu_max_isa` into
account which allows the user to limit the feature set to an
older generation of CPU to make the binary more portable.

PiperOrigin-RevId: 893544830
Imported from GitHub PR openxla/xla#40311

XLA:GPU switched to structured concurrency with `AsyncStartThunk` and `AsyncDoneThunk`, remove a bad experiment with `WaitForStreamThunks`.
Copybara import of the project:

--
303f683519bd73d80b474570cfe63953ccd83e7d by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:gpu] Delete vestigial WaitForStreamsThunk

Merging this change closes #40311

PiperOrigin-RevId: 893588227
PiperOrigin-RevId: 893592092
There's no reason to execute the generated HLO. This test exercises a utility
that generates a specific sequence of XlaOps. This change refactors the test
such that we're just comparing the generated HLO against an expected HLO.

The expected HLO was obtained from printing the module generated by the old test.

PiperOrigin-RevId: 893592373
…further investigation.

Reverts 7fc14e3

PiperOrigin-RevId: 893594429
Reverts c6d844d

PiperOrigin-RevId: 893602331
… module does not have a name

PiperOrigin-RevId: 893611161
PiperOrigin-RevId: 893615727
…ologyDescription from proto

PiperOrigin-RevId: 893622566
Imported from GitHub PR openxla/xla#40117

This PR updates the XLA Linux x86 GPU oneAPI presubmit coverage by expanding the test scope from //xla/stream_executor/sycl/... and //xla/service/gpu/... to the broader set //xla/..., //build_tools/..., and @tsl//tsl/..., ensuring more comprehensive validation.

Accordingly, it switches from _XLA_ONEAPI_TARGET_PATTERNS to _XLA_DEFAULT_TARGET_PATTERNS to align oneAPI presubmit checks with the default XLA test coverage.
Copybara import of the project:

--
50e4447d7e31ffd9f71fa20805a5b330089b77da by mraunak <mayank.kumar.raunak@intel.com>:

Update build.py
--
eb896f9d6386638e8150e815b18ecd0f88235dca by mraunak <mayank.kumar.raunak@intel.com>:

Update golden_commands.txt

Merging this change closes #40117

PiperOrigin-RevId: 893639525
@pull pull Bot locked and limited conversation to collaborators Apr 2, 2026
@pull pull Bot added the ⤵️ pull label Apr 2, 2026
ekayaaslan and others added 6 commits April 2, 2026 12:42
By this change, the shardy outliner translates the named computations into separate calls leaving it as a flat call graph.

PiperOrigin-RevId: 893645729
- Moved TransposePlanCache and its mutex from PjRtStreamExecutorClient, PjRtCpuClient, and TpuClient to CommonPjRtClient.
- Added GetTransposePlan to CommonPjRtClient for thread-safe access.
- Updated call sites to use the new centralized interface.

PiperOrigin-RevId: 893657648
…alAsync`

This fixes a bug where PjRt CPU buffers ignore major-to-minor in the layout inside `ToLiteral`. The CL also makes `CopyToLiteralAsync` perform the work asynchronously as intended by the API.

PiperOrigin-RevId: 893674302
…des on all memory kinds

PiperOrigin-RevId: 893691249
seantalts and others added 11 commits April 2, 2026 14:16
This flag preset will continue to be developed with fast compilation times and numerical stability in mind as the top goals (runtime performance only a secondary goal). Expect tradeoffs similar to XX% compilation time for X% runtime to occur under this flag.

Currently, it just sets LLVM codegen opt to O1, disables platform dependent math, and sets `flatten_after_fusion` to true.

PiperOrigin-RevId: 893693608
…ucket.table.

Also, cleanup includes.

Benchmarks are slightly negative, which is expected because the benchmark doesn't cover the high-contention/very-large-buckets motivating case. Note that the benchmarks are still faster than when we used tsl::Hash64.
```
name              cpu/op       cpu/op     vs base
BM_SendRecv       93.38n ± 2%   99.09n ± 1%  +6.12% (p=0.000 n=20)
BM_RecvSend       76.73n ± 1%   83.00n ± 1%  +8.18% (p=0.000 n=20)
BM_PingPong/100   308.9µ ± 2%   311.7µ ± 2%       ~ (p=0.841 n=20)
BM_PingPong/200   612.4µ ± 3%   614.2µ ± 2%       ~ (p=0.799 n=20)
BM_PingPong/300   929.6µ ± 3%   932.4µ ± 3%       ~ (p=0.968 n=20)
geomean           16.60µ        17.11µ       +3.11%

name              time/op       time/op     vs base
BM_SendRecv       93.59n ± 2%   99.32n ± 1%  +6.12% (p=0.000 n=20)
BM_RecvSend       76.89n ± 1%   83.19n ± 1%  +8.19% (p=0.000 n=20)
BM_PingPong/100   704.2µ ± 1%   693.8µ ± 3%       ~ (p=0.086 n=20)
BM_PingPong/200   1.434m ± 3%   1.393m ± 4%       ~ (p=0.201 n=20)
BM_PingPong/300   2.158m ± 2%   2.120m ± 2%       ~ (p=0.265 n=20)
geomean           27.49µ        27.91µ       +1.53%

name              INSTRUCTIONS/op  INSTRUCTIONS/op  vs base
BM_SendRecv       1.053k ± 0%       1.229k ± 0%  +16.71% (p=0.000 n=20)
BM_RecvSend        833.2 ± 0%       1008.2 ± 0%  +21.00% (p=0.000 n=20)
BM_PingPong/100   539.0k ± 0%       576.2k ± 0%   +6.90% (p=0.000 n=20)
BM_PingPong/200   1.024M ± 0%       1.098M ± 0%   +7.29% (p=0.000 n=20)
BM_PingPong/300   1.507M ± 0%       1.621M ± 0%   +7.55% (p=0.000 n=20)
geomean           59.24k            66.20k       +11.74%

name              CYCLES/op    CYCLES/op   vs base
BM_SendRecv        328.7 ± 2%    348.4 ± 1%  +6.00% (p=0.000 n=20)
BM_RecvSend        269.9 ± 1%    292.0 ± 1%  +8.21% (p=0.000 n=20)
BM_PingPong/100   649.2k ± 1%   650.8k ± 1%       ~ (p=0.841 n=20)
BM_PingPong/200   1.279M ± 1%   1.281M ± 2%       ~ (p=0.968 n=20)
BM_PingPong/300   1.917M ± 1%   1.926M ± 1%       ~ (p=0.369 n=20)
geomean           42.65k        43.92k       +2.97%

name              items/s      items/s    vs base
BM_PingPong/100   323.8k ± 2%   320.8k ± 2%       ~ (p=0.841 n=20)
BM_PingPong/200   326.6k ± 3%   325.6k ± 2%       ~ (p=0.799 n=20)
BM_PingPong/300   322.7k ± 2%   321.7k ± 2%       ~ (p=0.968 n=20)
geomean           324.3k        322.7k       -0.50%
```

PiperOrigin-RevId: 893694334
Main goal is to not include Eigen when all we need is error codes.

PiperOrigin-RevId: 893722480
`xla::ifrt::Value::ByteSize()` is a new API that asks the IFRT runtime to compute the byte size of the IFRT value object (an array or an upcoming tuple). This API will provide the user with a fast and accurate way to calculate on-device sizes of IFRT value objects, and the runtime would be responsible for providing a robust implementation of this calculation.

Since `xla::ifrt::Array` is a subclass of `xla::ifrt::Value`, `xla::ifrt::Array::ByteSize()` can be used without casting `xla::ifrt::ArrayRef` to `xla::ifrt::ValueRef`.

The initial implementation of this method uses `Layout::ByteSize()` or `PjRtLayout::ByteSize()` to compute it on the fly. The implementation currently does not cache or precompute it.

PiperOrigin-RevId: 893768289
Previously, JAX and StableHLO did not have asynchronous collectives. Thus,
every JAX program lowered, via StableHLO, to an HLO program without
asynchronous collectives.

Recently, we added asynchronous collectives to JAX and StableHLO, but the XLA
CPU backend doesn't support asynchronous collectives. If you try to run a JAX
program with asynchronous collectives on CPU, it will crash.

We can solve this in two ways. (1) We could do nothing and let the program
crash. (2) We could replace the asynchronous collectives with synchronous
collectives.

This CL implements option 2. The XLA CPU backend now immediately replaces any
asynchronous collectives with their synchronous counterparts.

PiperOrigin-RevId: 893768474
Updates LLVM usage to match
[7ccd92e5e6e5](llvm/llvm-project@7ccd92e5e6e5)

PiperOrigin-RevId: 893768833
@pull pull Bot merged commit 6ec44a0 into GesuBackups:master Apr 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.